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Abstract 

Consider a standard binary classification problem, in which (X, Y) 
is a random couple in x {0, 1} and the training data consists of 
TL i.i.d. copies of (X, Y). Given a binary classifier f : i— > {0, 1}, the 
generalization error of f is defined by R(f ) = P{Y ^ f (X)}. Its minimum 
R* over all binary classifiers f is called the Bayes risk and is attained 
at a Bayes classifier. The performance of any binary classifier based 
on the training data is characterized by the excess risk R(fn) — R*. We 
study Bahadur's type exponential bounds on the following minimax 
accuracy confidence function based on the excess risk: 

ACn(7W,A) =inf sup P (R(fn) - R* > A),Ae [0,1], 

where the supremum is taken over all distributions P of (X,Y) from 
a given class of distributions M and the infimum is over all binary 
classifiers fn based on the training data. We study how this quantity 
depends on the complexity of the class of distributions M. charac- 
terized by exponents of entropies of the class of regression functions 
or of the class of Bayes classifiers corresponding to the distributions 
from M. We also study its dependence on margin parameters of the 
classification problem. In particular, we show that, in the case when 
A" = [0, 1]'^ and M. is the class all distributions satisfying the margin 
condition with exponent a > and such that the regression function 
ri belongs to a given Holder class of smoothness |3 > 0, 

logACn(A^,A) ,2±« , i+« , , 

n 
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for some constants D,Ao > 0. 
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1 Introduction 

Let [X,A) be a measurable space. We consider a random variable (X, Y) 
in X {0, 1} with probability distribution denoted by P. Denote by [ix the 
marginal distribution of X in A* and by 

Ti(x) ^ rip(x) 4 P(Y = 1 |X = x) = E(Y|X = x) 

the conditional probability of Y = 1 given X = x, which is also the regression 
function of Y on X. Assume that we have n i.i.d. observations of the pair 
(X, Y) denoted by = ((Xt, Yt))i=i^...^n^. The aim is to predict the output 
label Y for any input X in X from the observations Pn- 

We recall some standard facts of classification theory. A prediction rule is 
a measurable function f : X \ — > {0, 1 }. To any prediction rule we associate 
the classification error (probability of misclassification) : 

R(f) ^p(Yj^f[X]). 

It is well known (see, e.g., Devroye et al. [3]) that 

min R(f] = R(f*) = R*, 

f : A-i >{0,1} 

where the prediction rule f*, called the Bayes rule, is defined by 

f*(x] = f;(x) = I{ti(x)>1/2}, Vx G A", 

where Ia denotes the indicator function of A. The minimal risk R* is called 
the Bayes risk. A classifier is a function, = Tn^(X, Pn^), measurable with 
respect to and X with values in {0, 1}, that assigns to the sample Pn a 
prediction rule fni'il^n) '■ ' — > {0> !}• A key characteristic of fn is its risk 
E[R(fn]], where 

R(fn) = P(Y^fn(X)p,). 
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The aim of statistical learning is to construct a classifier such that R(fn) is 
as close to R* as possible. The accuracy of a classifier is usually measured 
by the quantity E[R(fn] — R*] called the (expected) excess risk of fn, where 
the expectation E is taken with respect to the distribution of Vn- We say 
that the classifier learns with the convergence rate iKn), if there exists an 
absolute constant C > such that for any integer n, E[R(fn) — R*] < Ci|)(n). 

Given a convergence rate, Theorem 7.2 of Devroye et al. [1] shows that 
no classifier can learn with this rate for all underlying probability distribu- 
tions P. To achieve some rates of convergence, we need to restrict the class 
of possible distributions P. For instance, Yang [18] provides examples of clas- 
sifiers learning with a given convergence rate under complexity assumptions 
expressed via the smoothness properties of the regression function r|. Under 
complexity assumptions alone, no matter how strong they are, the rates can- 
not be faster than n^^'^^ (cf. Devroye et al. [1]). Nevertheless, they can be 
as fast as nr^ if we add a control on the behavior of the regression function 
r\ at the level 1 /2 (the distance l'n(-) — 1 /2| is sometimes called the margin). 
This behavior is usually characterized by the following condition introduced 
in [H]. 

Margin condition. The probability distribution P on the space A" x{0, 1} 
satisfies the Margin condition with exponent < oc < oo if there exists Cm > 
such that 

yix{0 < |ti(X) - 1/2| < t) < CmI*, VO < t < 1. (1) 

Equivalently, one can assume that ([1]) holds only for t G [0,to] for some 
to G [0,1). This would imply ([1]) for all t G [0,1) (with a larger value of 
Cm)- In this form, ([1]) makes sense also for a = +oo, it is interpreted as 
M-x(0 < |ti(X) — 1/2| < to) = 0, and it was used, e.g., in pJJ. Another 
equivalent form of margin condition ([T]) is discussed in the next section (see 
(fTOj) ) and it is characterized by the margin parameter k — +(x) /oc {k — ] for 
(X = +oo). Under the margin condition, fast rates, that is, rates faster than 
n^^''^ can be obtained for different classifiers, cf. Tsybakov [13], Blanchard 
et al. [2j, Bartlett et al. [5], Tsybakov and van de Geer [TB], Koltchinskii [H], 
Massart and Nedelec [TT], Audibert and Tsybakov [T], Scovel and Steinwart 
[L2\ among others. 

In this paper, we will study the closeness of R(fn) to R* in a more re- 
fined way. Our measure of performance is inspired by the Bahadur efficiency 
of estimation procedures but on the difference from the classical Bahadur 
approach (cf., e.g., [7]) we obtain non-asymptotic results. 
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For a classifier and for a tolerance A > 0, define the accuracy confidence 
function (or, shortly, the AC- function): 



Here P denotes the probability distribution of the observed sample Pn- Note 
that ACn(fn)A) = for A > 1 since < R(f] < 1 for all classifiers f. 
Moreover, R(fTi) — R* < 1 /2 for all interesting classifiers fn- Indeed, it makes 
no sense to deal with the probabilities of error R(fTi.) greater than 1 /2 (note 
that R(Tn) = 1 /2 is achieved when i-n is the simple random guess classifier). 
Therefore, without loss of generality we can consider only A < 1/2. In fact, we 
will sometimes use a slightly stronger restriction A < Ao for some Ao < 1/2 
independent of n. 

It is intuitively clear that if the tolerance is low (A under some critical 
value An), the probability ACnffn, A) is kept larger than some fixed level. On 
the opposite, for A > A^, the quality of the procedure can be characterized 
by the rate of convergence of ACn.(fn>A) towards zero as n ^ oo. Observe 
that evaluating the critical value An yields, as a consequence, bounds and the 
associated rates for the excess risk ER(fn) — R*, which is a commonly used 
measure of performance. 

For a class A4 of probability measures P, we define the minimax AC- 
function 

ACn(7W, A) = inf sup P (R(fn) - R* > A) , (3) 
fueSn PeM 

where is the set of all classifiers. We will consider classes M. = M.{r, a) 
defined by the following conditions: 

(a) A margin assumption with exponent a. 

(b) A complexity assumption expressed in terms of the rate of decay r > 
of an £-entropy. 

The main results of this paper can be summarized as follows. Fix r, a > 

1 +a 

and set An = Dn 2+a+r' where D > 0, and r' = r'(a, r) > is a function of a 
and r depending on the type of the imposed complexity assumptions. Then, 
we have an upper bound: There exist positive constants C, c such that, for 
all classes M. = M.{r, a) satisfying the above two conditions. 



ACn(fn,A)=P(R(fn)-R*>A). 



(2) 



ACn(A^, A) < Cexp{-cnATTt}, V A > A- 



(4) 
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Furthermore, we prove the corresponding lower bound: there exists a class 
satisfying the same conditions (a) and (b) such that 

ACn(-M, A) > po, < A < A, X A^, (5) 
ACn(-M,A] > C'exp{-c'nATTf}, A^ x A+ < A < Aq (6) 

for some positive constants po, C, c' and < Aq < 1/2 depending only 
on Cm and a. Thus, we quantify the critical level phenomenon discussed 

2+tx 

above and we derive the exact exponential rate exp{— cnA"r+^} for minimax 
AC-function over the critical level. In particular, this implies the following 
bounds on the minimax AC-function in the case when X = [0, 1]'^ and Ai 
is the class all distributions satisfying the margin condition with exponent 
a > and such that the regression function r\ belongs to a Holder class of 
smoothness (3 > (see Section [5^ : 

ACn(A^,A) > Po, < A < DirL"2+-+d/p, 
C'exp{-c'nATTf} < ACn(A^,A) < Cexp{-crLATTf }, 
D2n"2+a+d/p < A < Ao. 

As an immediate consequence of dl]) - ([6]) we get the minimax rate for 
the excess risk: 

inf sup [ER(fn) - R*l x n'T^ (7) 
fneSn PeM 

for appropriate classes Ai, which implies the results previously obtained in 
Tsybakov yL4j and Audibert and Tsybakov p]. 

It is interesting to compare (jl]) - ([6]) to the results for the regression prob- 
lem in a similar setting (see DeVore et al. [5] and Temlyakov p!3]) since there 
are similarities and differences. Let us quote these former results: suppose, 
in a supervised learning setting, that we observe n i.i.d. observations of the 
pair (X, Y), but here Y is valued in [— M, M] instead of {0, 1} and we want to 
estimate 

£,(x) =E(Y|X = x]. 
Let £,n(x] denote an estimator of £,(x) and consider the loss 

\\tn - ^Ikal^x)- 

Here and in what follows, || ■ ||lp(^x)' P — ^5 denotes the Lp(|Xx)-norm with 
respect to the measure |j.x on X. In this context, ACn(A^,A] denotes the 
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quantity 

inf supP(||L-£,||L,(^x)>A). 

It is proved in [5j and ^13j that if = \xx) is the set of probabihty 

measures having |j.x as marginal distribution and such that £, belongs to the 
set O, and the entropy numbers of O with respect to L2(|^x) are of order 
nr^ (see [5] and [13] for details), then there exist A^, A+, with x A+ x 
-rL-i-/(i+2r)^ and constants Sq, Ci, Ci, C2, C2 such that 

ACn(A^(e,^x),A) > 5o, VA<A;, (8) 
Cie-^'-^' < ACJA^(e,^x),A) < Cae-^^^^', VA > A+. (9) 

These inequalities describe accurately the behavior of the minimax AC- 
function for classes A^(0, (Xx) with any marginal distribution |j.x. The same 
inequalities hold for the following quantity 

supACn(Al(e,^x),A). 

Our results for the classification problem are somewhat weaker than the 
above results for the regression problem. In Sections [3] and HI we prove 
the upper bounds for the corresponding classes in the case of any marginal 
distribution |j.x such that the Margin assumption holds. This is analogous 
to what was obtained for the regression problem. However, in Section 15. 4[ 
we only prove the matching lower bounds for a special marginal distribution 
(j-x- Thus we obtain an accurate description of the behavior of the supremum 
over marginal distributions sup^^ ACn(A^, A) and not of the individual AC- 
functions for each marginal distribution |j.x. 

The similarity of the results in the two different settings is that there is 
a regime of exponential concentration, which holds for any A greater than a 
critical level. This critical level, which is also the minimax rate, depends on 
the complexity of the class characterized by r. We can also observe that the 
exponents in the bounds ( in classification, 2 in regression) do not depend 
on the complexity parameter r. 

The differences lie in two facts since the margin condition is entering the 

l + g 

game at two levels. The first one is the critical value itself, n 2+^+^. Note 
that here a is appearing in a favorable way (the larger it is, the better the 
rate). This is intuitively clear since larger a correspond to sharper decision 
boundaries. 
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The second place where a difference occurs is the rate in the exponent 
Ai+a compared to in a regression setting. The margin condition inffuences 
the rate and this time again in a favorable way with respect to a (the 
rate improves as oc grows). For a ^ 0, that is, when there is no margin 
condition we approach the same rate as in regression. 



2 Properties related to the Margin condition 

In this section, we discuss some facts related to the Margin condition. We 
first recall that it can be equivalently defined in the following way, cf. [Tij . 

Proposition 1. A probability measure P satisfies the Margin condition (J\) 
if and only if there exists a positive constant Cm such that, for any Borel set 
GcX, 

|2ti(x] - 1|^x(dx) > CuMGr, (10) 

. G 

where x = (1 + (x)/oc. 

Proof: Let G be given. Clearly, it suffices to assume that |J.x(G] > 0. 
Choose t from the equation (J-xlG) = ZCjvit*. Then by the Margin condition 



^x(G \ {0 < |ti(X) - 1/2| < t}) > ^ix(G) - Cm^^ > CMt". 



Therefore, 



|2ri(x)-l|Hx(dx) > 2 



tM-xldx) 



G\{x:0<|ri(x)-l/2|<t} 



> 20^,^^+^ = (2Cm]"^/Vx(G)^+^/^ 



(11) 



Conversely, assume that for some x > 1 inequality f lTOj) holds for any Borel 
set G. Take G = {x : < |ri(x] - 1/2| < t}. Then yields 



^ix(0<|Ti(X)-1/2|<t] < c 



'M 



|2ri(x) - l||ix(dx) 



0<|Ti(x)-l/2|<t 



< (2cMH^ix(0<|Ti(X]-1/2|<t]) 



i/>. 



Solving this inequality with respect to |J.x(0 < \t][^) ~ 1 /2| < t) we obtain 
the Margin condition ([1]). □ 
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Remark 1. The constant Cm in Margin condition (|7]j satisfies 

Cm > 1/2. 

Proof: By ( fTTj) we have that (fTOj) holds with constant Cm = (2Cm)"^'^"- 
Using this and the fact that < ri(x) < 1 we get M.x(G) > CmM-x(G)^ = 
(2CM]"^/'^^tx(G)^ for all G G X. Thus, 2Cm > ^x(G), and since this 
holds for all G and (Xx is a probability measure we get the result. 

Remark 2. The statement of Proposition [I] also holds with k — 1 for the 
case (X = +00, which is understood as discussed after the definition of Margin 
condition (Qp. 

We now state an easy consequence of Proposition [H 

Lemma 1. // the probability measure P satisfies the Margin condition (QP, 
then for any prediction rule f , 

R(f)-r >(2CM)-^/"||f-f;il[f,,)- 

Analogously, if the probability measure P satisfies the Margin condition pOjj 
with some x > 1 , then for any prediction rule f, 

R(f]-R*>CMiif-f;iir,(,,). 

Proof: Note that, for any prediction rule f. 



m 



|2Ti(x]-1|^x(dx), (12) 

Dp(f] 



where Dp(f] = {x : f^(x) ^ f(x]}. By ^ we have that ([10]) holds with 
constant Cm = (2Cm]^''*- Thus, the result follows from (fTOl) and the obvious 
relation 

^x(Dp(f)) = ||f-f;iiL,(^x)- 

Finally, we will use the following property. 

Proposition 2. For any Borel function fj : A:" ^ [0, 1] and any distribution 
P of (X, Y) satisfying the Margin condition (QJ), we have 

llfn -fplkit^tx] < 2CM||ri -rip||^^(^^) 
where if\M = l{r][x]>)/2}- 
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Proof: By Lemma 5.1 in [T], 

R(f,)-r <2Cm||ti-tip||[:'[^^,. (13) 
This and Lemma [1] yield the result. 

Corollary 1. Let V be a class of joint distributions of (X, Y) satisfying the 
Margin condition (QP and all having the same marginal \Xx. Then, for any 
pair P, P G P with the corresponding regression functions r\^f\ and decision 
rules i^ix) = I{n{x]>i/2}, M'X-) = I{fi{x)>i/2}j have 

\m - fTil|L,{^tx) < ^CmIIti -tiIIl^,^^) . 



3 Upper bound under complexity assumption 
on the regression function 

In this section, we prove an upper bound of the form (j4]) for a class of 
probability distributions P, for which the complexity assumption (b) (cf. the 
Introduction) is expressed in terms of the entropy of the class of underlying 
regression functions rjp. 

For g : X ^ R, define the sup-norm ||g||oo — supxg;f |g(x)|. 

Fix some positive constants r, a, C)vi> B. Let A^(r, oc) = A^(r, a, Cm> B] 
be any set of joint distributions P of (X, Y] satisfying the following two con- 
ditions. 

(i) The Margin condition IjJ^ with exponent oc and constant Cm- 

(a) The regression function t] =r\p belongs to a known class of functions U, 
which admits the entropy bound 

H(£,W, II • lloo] < Be-\ Ve >0. (14) 

Here, the ^.-entropy 'H(£,W, || • ||oo) is defined as the natural logarithm of 
the minimal number of e-balls in the || ■ ||oo norm needed for covering U. 

For any prediction rule f , we define the empirical risk 

1 

i=l 
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We consider the classifier fn,^^'X■) = I{fin(x)>i/2}) where 

fin = ar gmin^ , Rn ( f ^ ' ) . 

Here fn'(x) = I{n'(x)>i/2} and A/'e denotes a minimal e-net on U in the || • ||oo 
norm, i.e., A/'e is the minimal subset of U such that the union of e-balls in 
the II ■ lloo norm centered at the elements of A/l covers U. 

Theorem 1. Let r, a, Cm> B be finite positive constants. Set e = = 
rL^2+a+r . Then there exist positive constants c and c' depending only on 
r, a, Cm.1 B such that 

sup P{R(fn,i ) - R(f;) > A} < 2exp{-cnATTf } 

P€M[r,CK] 

l + g 

for A > c n 2+oi+r . 

This theorem has an immediate consequence in terms of AC-functions. 
Corollary 2. There exist d > 0, c > such that for An = dn 2+<x+r have 

ACn(A^(r,cx),A) < 2e-'^^^, V A > An. (15) 



Proof of Theorem [H Set d(ri'] = R(f;;) - R(f^). Let -n g TV; be such 
that ||fi — Tip lloo < £• Using (fT3|) we get 



d(-n) = R(ff,) - r < ICmII-H -tipC« < 2Cm£^+" < A/2 (16) 

for any A > 4CM'n.^^+^+'' • Define a set of functions Qt — {r[' E J\ft ■ d[r[') > A}, 
and introduce the centered empirical increments 

Then 

P(R(fn,i)-R(f;)>A] < P(3Ti'e6;, :Rn(f;;)-Rn(f^)<0) 

< }^ P(d(Tl')+^n(TlO-d(Tl)-Zn(Tl] <0). 

Note that for any r] ' G ^£ we have 

d(Ti')-d(Ti)>d(Ti']/2>A/2. 
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Using this remark and fll4p we find 

P(R(fn,i)-R(rp)>A] < }^ P(Zn(Tl') < -d(Ti')/4) (17) 

+nZnif[) > A/4) 
< exp(B£-')maxP(Zn(ri') < -d(ri')/4) 

+P(Zn(Ti)>A/4). 
Now, ZnhO = iHili ^ihO, where 

Clearly, |f,t(ri')| < 2 and, by Lemma [H 



E[h{^')') < E([l{,^,(x,)^Yj-I{f;(xo^Yj] 

= ||fri' - fplkilux) 

= (2CM)"f^d"iT^(ri'). 
Therefore, we can apply Bernstein's inequality to get 

nd^(Ti')/16 



P(2n(Tl']<-d(Ti']/4)<exp 



2((2CM)^dTf^(Tl') + d(Ti')/3), 



< exp 

' cjd— (tiO 

1 a 

where c\ — 2((2C)vi)~ + 1/3) and we used that d(ri') < d~(ri') since 
dlri') < 1. Thus, for any ri' G ^£ we obtain 

nZnin'] < -d{^')/4) < exp (-nA^/c;) . 

As a consequence, 

explBe"'] maxP(Zn(ri'] < -d(Ti')/4] < exp(Bn2T^ -rL7\^/c\) 

< exp(-rLATTf/2c;) (18) 
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1 +a 

where we used that A > c'Tt^^+a+r for some large enough c' > 0. Another 
apphcation of Bernstein's inequahty and (flGjl yields 

nAVl6 



P(Zn(Ti) > A/4) < exp 



< 



exp 



2((2Cm) — d— (-nj+A/S) 

TlA^ 



c|(A~ + A) 
For A < 1 the last inequality implies 



F(Z,(ti) > A/4) < exp 



2+ix ■ 

nA'+a 



This, together with (fT7|) and ( !T8|) . yields result of the theorem for A < 1. If 
A > 1 it holds trivially since d(ri') < 1 for all r)'. 



4 Upper bound under complexity assumption 
on the Bayes classifier 

In this section, we prove a result analogous to those of Section [3] when the 
complexity assumption (b) (cf. the Introduction) is expressed in terms of the 
entropy of the class of underlying Bayes classifiers fp rather than of that of 
regression functions rip. 

First, introduce some definitions. Let J-" be a class of measurable functions 
from a measurable space (S,^s? M-) into [0, 1]. Here [j, is a a-finite measure. 
For 1 < q < oo, and £ > 0, let N[](£,J^, || ■ HiqiiJ.)) denote the Lq(|a.)- 
bracketing numbers of T . That is, N[ ](£, J-", || ■ ||Lq(|x]) is the minimal number 
N of functional brackets 

[fr,ff]^{g:fr<g<f+}, j = l,...,N, 

such that 

N 

^cUCfpf^] and ||ff-fr||^(^5<£, j = 1,...,N. 
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The bracketing e-entropy of T in the || • ||Lq(^)-norm is defined by 



II ■ llm^)) = logN[ ](£,-7^, II • Ikqin)). 



We will consider a class of probability distributions P of (X, Y) charac- 
terized by the complexity of the corresponding Bayes classifiers. Specifi- 
cally, fix some p G (0, 1),0 < a < oo,Cm > 0,c^ > 0,B' > 0, and let 
A^*(p, a) = A^*(p, oc, c^i, B') be any set of joint distributions P of (X, Y) 
satisfying the following conditions. 

(i) The marginal distribution [ix ofX is absolutely continuous with respect 
to a G-finite measure \i on [X^A], and (d|J.x/d|J.) (x) < for \i-almost 
all X E X . 

(a) The Margin condition [T^) with exponent x = (1 + a)/a and con- 
stant Cm is satisfied (we adopt the convention that x = 1 corresponds 
to (x = oo). 

(Hi) The Bayes classifier fp belongs to a known class of prediction rules T 
satisfying the bracketing entropy bound 



The results below still hold in this slightly more general situation. 
We consider a classifier f^^i that minimizes the empirical risk over the 
class J-' : 



The main result of this section is that for fn,2 we have the following expo- 
nential upper bound. 

Theorem 2. Let p G (0, 1),0 < a < oo, and let CM)C^, B' be positive 
constants. Then there exist positive constants c and c' depending only on 
p, (X, c^, B' such that 



l,m)<B'£-p, V£>0. 



(19) 



fn,2 = argminfgj-Rn(f). 



sup P{R(fn,2) - R(fp) > A} < e exp{-crLATTf } 




2+a 



1 , and 
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We deduce Theorem [2] from the following fact that we state here as a 
proposition. 



Proposition 3. There exists a constant C^. > depending only on p, a, Cm 
such that, for all t > 0, 



sup p|R(fn,2)-R(f;) > C, 



71 



< e 



i-t 



It is easy to see that Theorem [2] follows from this proposition by taking 

2+ a l + g 

t — ctlA" with A > c'n 2+a(i+p) for some constants c,c' > 0, and using 
that X = — if a < CO. 

a 

Proposition |3] will be derived from a general excess risk bound in abstract 
empirical risk minimization ([lO], Theorem 4.3). We will state this result 
here for completeness. To this end, we need to introduce some notation. Let 
^ be a class of measurable functions from a probabihty space (S, As, P) into 
[0, 1] and let Zi, . . . , Zn be i.i.d. copies of an observation Z sampled from 
P. For any probability measure P and any g E Q, introduce the following 
notation for the expectation: 



gdP 



Denote by P^ the empirical measure based on (Zi, . . . , Zn), and consider the 
minimizer of empirical risk 

= argmiUgggPng. 
For a function g G ^, define the excess risk 

£p(g)^Pg- inf Pg'. 

The set 

-^p(S)={gG^:^p(g)<6} 

is called the 5- minimal set. The size of such a set will be controlled in terms 
of its L2(P)-diameter 



D(6)= sup ||g - g'||L2(P) 
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and also in terms of the following "localized empirical complexity" : 
(t)n(6]=E sup |(Pn-P)(g-g')|. 

We will use these complexity measures to construct an upper confidence 
bound on the excess risk ^p(Tn,,2). For a function i|) : M_|_ i-^ define 



^\)'{8] ^ sup 



(j>6 0" 

Let 



5 > 0,t > 0, 



and define 

<4inf{a:K(a)<l}. 



The following result is the first bound of Theorem 4.3 in |10j . 
Proposition 4. For all t > 0, 

P{^p(fn,2)><}<e^-^ 

In addition to this, we will use the well-known inequality for the expected 
sup-norm of the empirical process in terms of bracketing entropy, see The- 
orem 2.14.2 in flT]. More precisely, we will need the following simplified 
version of that result. 

Lemma 2. LetT be a class of functions from S into [0, 1] such that ||g||L2(P) ^ 
a for all g G T. Assume that H[ ](a, T, || ■ ||l2(p)] + 1 < O-^f^- Then 

L /. . , _ N 1/2 



Esup|Png-Pg| < 
ger VTT- 







where C > is a universal constant. 



(H[ ](£,T, II ■ ||l2(p)] + 1) d£, 



Proof of Proposition [HI Note that, if t > n, then > 1, and 

the result holds trivially with C* ~ 1 since R(fn,2) ~ R(fp) < 1- Thus, it is 
enough to consider the case t < n. 

Let S = A" X {0, 1} and P be the distribution of Z = (X, Y). We will apply 
Proposition m to the class Q = {gf : gf(x, ij) = I{y^f(x)}, f G J-'}. Then, clearly. 
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Pgf = R(f) and £p[gi] = R(f) - R(fp) for gf(x,ij) = I{y^f(x)}, which imphes 
that 

W]={gr.feT, R(f)-R(f;]<6}. 

We also have ||gf^ ~9f2llL2(P) ~ "f^lkiinx)- Thus, it follows from Lemma[T] 
that, for all gf G 

£p{gf) > cwllgf- gf;|lL2(p] 

and we get a bound on the L2(P)-diameter of the 6- minimal set J^p(6] : with 
some constant Ci > 



D(6] < Ci5^/(^"'. 



(20) 



To bound the function 5), we will apply Lemma [2] to the class T = -Fp(6) 
with a = 1 . Note that 

H[](£, J'pIS], II • ||l2{p)) < 2Hn(£/2,^, II • ||l2(p)) 



< 2Hn(eV4,-F,||-||L, 



< 2Hn(£V(4c^),-^,||-||L,(^)). 
Using (fT9|) we easily get from Lemma [2] that, with some constants Ci, C3 > 0, 



4>n(S) < CiS^n 5 > Can i+p, 



which implies that, with some constant C4 > 0, 

(t^n(S) < C4max(5^n"^''^,rL""i+p), 5 > 0. 
This and (l20l) lead to the following bound on the function V|^(6]: 



V^(6) < C5 



^-1 



n n 



that holds with some constant C5. Thus, we end up with a bound on cr^ : 



0"n < C6 



n 



V 



n 



(21) 



Note that, for x > 1 , p < 1 and t < n, we have 



^-V(2^-i+P) > ^-i/(i+P) and 



n 



> 



n 
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Therefore, (1211) can be simplified as follows: 



<<C7 

and the result immediately follows from Proposition |H □ 
Note that Theorem [2] remains valid if we drop condition (i) and replace 
(iii) by the following more general condition: 

( iii ') The Bayes classifier f p belongs to a known class of prediction rules T 
satisfying the bracketing entropy bound 

^[](£,-^,||■||L,(^x))<B'£-^ V£>0. (23) 

Condition (iii') is, in fact, an assumption on both J-" and the class of pos- 
sible marginal densities |^x- The reason why we have introduced conditions 
(i) and (iii) instead of (iii') is that they are easily interpretable. Indeed, in 
this way we decouple assumptions on and ]ix- The case that is even eas- 
ier corresponds to considering a subclass of A^*(p, ot) composed of measures 
P e A^*(p, a) with the same marginal |j.x. Then again we only need to as- 
sume (ii) and (iii') but now (iii') should hold for one fixed measure |Xx and 
not simultaneously for a set of possible marginal measures. 

We finish this section by a comparison of Theorems [T] and |2J They differ 
in imposing entropy assumptions on different objects, regression function rjp 
and Bayes classifier fp respectively. Also, in Theorem [1] the complexity is 
measured by the usual entropy for the sup-norm, whereas in Theorem |2] it 
is done in terms of the bracketing entropy for the Li-norm. Note that for 
many classes the bracketing and the usual e-entropies behave similarly, so 
that the relationship between the corresponding rates of decay r in f lT^ and 
p in ( !T9|) is only determined by the relationship between the sup-norm of 
the regression function r| and the Li-norm on the induced Bayes classifier. 
In this respect. Corollary [1] is insightful suggesting the correspondence p = 
r/oc. In the next section, we will see that such a correspondence exactly 
holds when the regression function r\ belongs to a Holder class. Finally, 
note that the ranges of the margin and complexity parameters as well as the 
assumptions on the measure [ix in Theorems [1] and [2] are somewhat different. 
Namely, Theorem [1] holds under no additional assumption on (Xx except for 
the Margin condition and covers classes with high complexity (all r > 
are allowed). Theorem |5] needs a relatively mild additional assumption (i) 



>r/ I J- 



(22) 
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on i^x and restricts the complexity by the condition p < 1 . On the other 
hand, Theorem [2] estabhshes the rates under the Margin assumption f llOl) 
with X = 1 not covered by Theorem [H In addition to this, the classifier fn,,2 
of Theorem [2] does not require the knowledge of the margin parameter a. 
Thus, this method is adaptive to the margin parameter. On the other hand, 
the classifier f^^^i of Theorem [T] does require the knowledge of a which is 
involved in the definition of parameter e of the net A^. Note that for classes 
T of high complexity (with p > 1 ) the empirical risk minimization over the 
whole class T usually does not provide optimal convergence rates. In such 
cases, some form of regularization is needed. It could be based on penalized 
empirical risk minimization (see, e.g., [lOJ) over proper sieves of subclasses 
of T (for instance, sieves of e-nets for J-"). 



5 Minimax lower bounds 
5.1 A general inequality 

For two probability measures \x. and v on a measurable space (A",^), we 
define the Kullback-Leibler divergence and the x^-divergence as follows: 



glngd^, x (H,^) 

X 



[g-1]^dv, (24) 



X 



if \i. is absolutely continuous with respect to v with Radon- Nikodym deriva- 
tive g = and we set /C(|j., v] = -|-cxd, x^(M-, ^) — +oo otherwise. 
We will use the following auxiliary result. 

Lemma 3. Let (A*, A] he a measurable space and /ei At G >4., i G {0, 1 , . . . , M.}, 
M>2, be such that Vi 7^ j, A^ n Aj =0. Assume that Qt, i G {0, 1 . . . , M.}, 
are probability measures on [X,A] such that 



1 ^ 

-^;C(Q^,Qo)<x<oo. 



Then 



^ max Qi^AfX At) > ^mm{], Me-^""} . 

0<i<M 12 
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Proof: Proposition 2.3 in [12] yields: 

P* > sup — — 1 . 

-o<t<itM+1 \^ logT j 

In particular, taking T* = min(M.^\ e^^'*^) and using that i/SToglVT > 2 for 
M > 2, we obtain 

T*M + 1 \ logT* / 12 

We now prove a classification setting analogue of the lower bound ob- 
tained by DeVore et al. [5] in the regression problem. 

Theorem 3. Assume that a class O of probability distributions P with the 
corresponding regression functions r\p and Bayes rules fp (as defined above), 
contains a se^lPjJl^ C O, N > 3, with the following properties: the marginal 
distribution ofX is \Jix for all Pi, independently ofi, where [ix is an arbitrary 
probability measure, 1 /4 < rjp^ < 3/4, i = 1 , . . . , N, and for any i ^ j 

II^Pt-TiPjlkit^x) <y> (25) 

iif;,-f;jiL,(^x]>s (26) 

with some y > 0, s > 0. Then for any classifier fn we have 

max Pk{|| t - \\u (^x) > s/2} > -\ min (1 , (N - 1 ) exp{-12ny^}) (27) 

where denotes the product probability measure associated to the i.i.d. n- 
sample from P^. 

Proof: We apply Lemma [3] where we set Qt = Pi, M. = IM — 1 , and define 
the random events Ai as follows: 

At = {2^n : ||fn - Ik, (^x) < s/2}, 1 = 1 , . . . , N. 

The events A^ are disjoint because of fl26l) . Thus, the theorem follows from 
Lemma [3] if we prove that /C(Pi, Pj) < 4ny^ for all i, j. 
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Let us evaluate /C(Pi,Pj). For each rip., the corresponding measure is 
determined as follows 

dP^(x,y) ^ (Tip,(x)d6i(-y] + (l -Tip,(x))d6o(-y))d^txM, (28) 

where dS^ denotes the Dirac measure with unit mass at £,. Set for brevity 
rii = riPj. Fix t and j. We have dPi{x^y) = g(x,'y)dPj(x, y), where 

91^,1) = ^^, g(x,oj = - --. 

rij(x) l-rij(x) 
Therefore, using the inequalities 1/4 < r|^,rij < 3/4 and f l25|) we find 



(rii(x) -Tij(x))2 , (Tli(x) -rij(x))^ 



rij(x) l-Tij(x) 
<8hi-Tlj|lL(.x)<8T'. 



d|j,x(x] 



(29) 



Together with inequality between the Kullback and x^-divergences, cf. [T5] . 
p. 134, this yields 

/C(Pi,Pj) = n/C(Pt, Pj] < nx'(Pi, Pj)/2 < 4ny2. 



□ 



5.2 Construction of a finite family of measures 

Theorem [3] can be applied in various situations by choosing suitable proba- 
bility measures Pt, i = 1 , . . . , N. In this section, we suggest such a particular 
choice, which will give lower bounds for classification. 

Let 0" = (o"! , . . . , cTb) be a binary vector of length b with elements G 
{—1,1}. Let cp be an infinitely different iable function with compact support 
in M'^ such that < (p(x) < c for some constant c G (0,1/2). Consider 
functions cpi , . . . , cpb on M'^ satisfying: 

a) (pj is a shift of cp, j = 1 , . . . , b, 

b) the supports Aj of functions cpj are disjoint. 

Denote by 21(b) the set of all binary vectors o" of length b. For every 
a G 1(b) define 

b 

4^a(x) = Y_ IIct(^) = (1 + 4)a(x))/2. 
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Consider the following class & of regression functions 

e^{Ti„0Gi(b)}. 

In what follows we assume without loss of generality that b > 16. By the 
Varshamov-Gilbert lemma (cf. [15], p. 104), there is a subset S of 21(b) such 
that cardinality |S| > 2^^^, and for any two different elements o" and cr' from 
S we have 

||o-- o-'llf, > b/4. (30) 

Let X = [0, l]^ q G N, and b = q'^. Let ^\) be a nonnegative infinitely 
differentiable function with support (0,1)'^ such that t|) < c < 1 /2 and 
'4'('^]dx > 0. For given parameters 6 G (0,1) (small parameter) and 
oc G [0, oo), define 

(p(x) 4 6V(i+«]^(q^)_ 

For a vector k = (ki , . . . , k^), kj G {0, . . . , q — 1 }, j = 1 , . . . , d, define a grid 
point 

= (x^,...,x^), xf = kj/q, j = l,...,d. 

We now consider b functions (pk(x) = ^)[x — x^) and the corresponding class 
O of regression functions defined above. We set N = |S| and consider a subset 
e' C 6: 

e'^K,aGS} = {rf=i- 

Now, recalling that the regression function ri(X) is the conditional probability 
of Y 1 given X, we define the joint probability measures P(j, ct G S, of (X, Y) 
(these measures will be also denoted by Pi, i = 1 , . . . , N) : 



P,(Y = 1,Xg A) 



ria(x)iJ,x(dx) 

A 



for any Borel set A, where the marginal distribution — M-x is specified as 
follows. First, for all x such that 

V(4q)<Xj-xf <3/(4q), j = l,...,d, 

the distribution has a density w.r.t. the Lebesgue measure 

-d^^"^ = Leb(B(0,1/(4q)) ='^"^ ^''^ 
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where B(x, r) is the £oo-ball of radius r centered at x, Leb(-) denotes the 
Lebesgue measure, and w = C5*/'^+°''/b for some C G (0,1]. Second, we 
set d\x^[x]/dx — for all other x such that at least one of rii(x) is not 1 /2. 
Finally, on the complementary set Aq C [0, 1]'^ where all 'r|i(x) are equal to 
1/2, we set d|J.x(x)/dx = (1 — bw)/Leb(Ao) to ensure that Jj^^ d|J.x(x] = 1 
(we assume that the support of the function ^\) belongs to the set [y, 1 — y] 
for a small y > 0; then, it is easy to see that Leb(Ao) > 0). 

We now impose an extra restriction on cp and prove that under this re- 
striction the measures Pi satisfy the Margin condition with parameter ex. 
Assume that \[)(x) = C2 > for x satisfying the inequalities 1/4 < Xj < 3/4, 
j = 1 , . . . , d, and iKx) < Cj for other x. Here C2 G (0, 1 /2). Then 

b 

^1^(0 < |ti,(X) - 1/21 < t) = ^i* (0 < I Y_ o-j(Pj(X)l < 2t) 

j=i 

= b^*(0<(p(X)<2t), 

because the supports Aj of functions cpj are disjoint. Then, using the defini- 
tion cp(x) = 5^/'^+'^'\|j(qx) we obtain that 

|x* (0 < (p(X) < 2t) = w if C26^/t^+"' < 2t 

and \L^iO < (p(X) < It) = for all other t > 0. Therefore, 

b^* (0 < (p(X) < It) < C5«/(i+'^'l^,^5,/(i+a,<2t} < C(2t/C2]^ t > 0. 

Thus, all Pi satisfy the Margin condition with parameter cc and constant 
Cm = C(2/C2)«. 



5.3 Minimax lower bound for classification 



Let us check the assumptions of Theorem [3] for the set of probability measures 
Pi , . . . , Pn defined in Section 15. 2[ Since < c < 1 /2 we have 1 /4 < r|i(x) < 
3/4 for all 5 G (0, 1 ) and all x G (0, 1 )'^. Next, for any ff, cr' G S we have 

and for cr 7^ cr', in view of (l30l) and fl3T|) . 

b 



'Pa 



-1 ll^xJ 



2 Y I{a^^a 



3=1 



2'^bw dx 



B(0,l/(4q)) 



cr- 



(32) 



(33) 
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where Ci = C/4. Thus, the assumptions of Theorem [3] are satisfied with 
N = |S| > 2^/** > 2^/1^ + 1, and 

y = C5(2+oc)/(l+a)^ g ^ Ci6«/t^+'^^ (34) 
Therefore, we get the following result. 

Proposition 5. Fix a > 0, 5 G (0, 1) and q G N such that b = q'^ > 16. Let 

Pi , . . . , Pn he the family of probability measures defined in Section lSTB . Then 
for any classifier we have 

||fn-f;j|Li(^'] > — g— I > :^min(1,2'6exp{-C3n6i+«}] (35) 

where C G (0, 1) zs the constant used in the construction of Section [371, and 
0$ > is a constant depending only on C. Furthermore, for < A < Aq, 

max Pk{R(fn) - R(f;j > A} > min(l , Irs exp{-C4nATTf }) (36) 

l<k<N 12 

where Aq = 16^'^+*'''*Cc2, and C4 > is a constant depending only on C, Cz 
and cc. 

Proof: Bound fl35|) follows from Theorem [3] and flM|) . To prove fl36|) . we 
combine fl35|) with Lemma [T|, set A = AqS, and use that Cm = 0(2/02)" by 
the construction of Section 15.21 

5.4 Application to a particular class of distributions 

In this section, we will assume that the regression function r| belongs to a 
Holder class defined as follows. 

For any multi-index s = (si , . . . , Sa) and any x = (xi , . . . , x^) G R'^, we 
define |s| — Y.i=^ Si, s! — Si ! . . . s^l, = x^\ . .x^'' and ||x|| = (xf + ■ ■ ■ + 
x^)^/^. Let denote the differential operator = — ^4^. 

1 d 

For |3 > 0, let [pj be the maximal integer that is strictly less than (3. 
For any x G [0, 1]'^ and any [(3 J times continuously differentiable real valued 
function g on [0, l]'^, we denote by its Taylor polynomial of degree [|3J at 
point X G [0, 1]^: 

gx(x') ^ Y. ^^^^D^gW- 

^ — s! 

|s|<L(3J 
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Let (3 > 0, L > 0. The Holder class of functions Z((3, L, [0, 1]'^) is defined 
as the set of all functions g : [0, 1]'^ ^ M that are [|3J times continuously 
different iable and satisfy, for any x, y G [0, 1]'^ , the inequality 

Ig(x') -gx(x']| < L||x'-xf . 

We now apply the technique of proving minimax lower bounds developed 
in the previous sections to the following class of distributions. 

Fix a > 0, |3 > 0, L > 0, and a probability distribution ]ix on [0,1]'^. 
Denote by M'(|^X) ct, (3) the class of all joint distributions P of (X, Y) such 
that: 

(i) The marginal distribution ofX is (Xx/ 

(a) The Margin condition (J\) is satisfied with some constant Cm > 0; 

(Hi) The regression function r\ — r\p belongs to the Holder class Z((3, L, [0, 1]'^). 

Theorem 4. Let (x^ be the marginal density defined in Section I5.H There 
exist positive constants C\, c' and d(, Aq depending only on a, |3, L, d, 
and Cm such that for any classifier i^, 

sup P{R(fn) - R* > A} > C; 

for any < A < djn 2+'''+d/i3 ^ and 

sup P{R(fn) - R* > A} > C2 exp{-c'nATT^} 

1 +a 

for any d2n 2+«+'1/p < A < Aq. 

1 

Proof: Set q = [C56 n+aip] where C5 > is a constant, and [x] denotes 
the minimal integer greater than x. It is easy to see that if C5 is small 
enough, then (see Section 15721) we have cp G Z(|3, L, [0, 1]'^] implying that 
r[a G 1((3, L, [0, l]"^) for all a G S. Choose such a small C5. It is also easy to 
see that one can always choose constants C G (0, 1 ) and Ci G (0, 1 /I] in the 
construction of Section [572l in such a way that C[2/c2)°^ < Cm which is needed 
to satisfy the margin condition (ii). Then, for any fixed 5 G (0, 1), the finite 
family of probability distributions {Pi,...,Pn} constructed in Section [572] 
and depending on 6 belongs to A^'(|j.x, a, |3). To indicate this dependence 
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on 6 explicitly, denote this family by Vx where A = AqS and Aq is defined in 
Proposition O Since Vx C P); for any A < Aq we can write 

sup P{R(fn) - R* > A} > maxP{R(fn) - R* > A} 



and then estimate the right hand side of this inequality using (136|) of Propo- 
sition O Note that in Proposition O we have the assumption q'^ > 16, which 
is satisfied if 5 < Sq where 6o is a small enough constant depending only on 
the constants in the definition of the class oc, |3). Thus we obtain 

sup P{R(fn) - R* > A} > min(1 , 2^/i^exp{-C4nATTf }) 

> ^ min(1 , exp{c6A^(i+«'P — C4nA^}) 

for all < A < Aq where Aq > and Cg > depend only on the constants 
in the definition of the class A1'(|j.x, oc, (3). This immediately implies the 
theorem. □ 

Note that the class of distributions M.' oCy ^] has the following prop- 
erties. 

(A) There exists a constant B > such that the set of regression functions 
U = {rjp, P G M.'[[ixy '^■> P)} satisfies the entropy bound 

H(£,W, II- lloo) <B£-'-, V£>0, (37) 

where r = d/(3. 

(B) There exists a constant B' > such that the set of Bayes classifiers 
J-" = {fp, P G A^'(|J.x, oc, |3]} satisfies the bracketing entropy bound 

•Hn(£,-^,MlL,(^*))<B'£-P, V£>0, (38) 

where p = d/(a.|3). 

Indeed, (A) holds since W = {ri G I(|3, L, [0, 1]'^) : < ri(x] < 1}, and 

H(£,I((3,L,[0,1]^),||-|U)<B£-^/P, 

cf. Kolmogorov and Tikhomirov [8]. Moreover, this bound holds if we replace 
the £-entropy ?{[■■, ■) by the bracketing £-entropy T-Li ](■, ■, ■) depending on 
the same arguments, cf. Dudley p]. This and Corollary [T] imply 
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In conclusion, for the choice of described in Section I5.2[ the class of 
probability distributions A^'dx^, a, (3) is a particular case of both Ai{r, oc) 
(with r = d/|3) and of A^*(p, a) (with p = d/(a|3) and [i — yi^) defined 
in Sections |3] and HI Theorem H] shows that, for this particular case, it is 
impossible to obtain faster rates for AC-functions than those established in 
Theorems [T] and [2J In this sense. Theorem H] provides a lower bound that 
matches the upper bounds of Theorems [T] and [2J 
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